JOURNAL OF KTbX CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 



1 



Learning content similarity for music 
recommendation 

Brian McFee, Student member, IEEE, Luke Barrington, Student member, IEEE, and Gert Lanckriet, Member, IEEE 



o 

(N . 



C/3 



> 

m 
o 



X 



Abstract — Many tasks in music information retrieval, such 
as recommendation, and playlist generation for online radio, 
fall naturally into the query-by-example setting, wherein a user 
queries the system by providing a song, and the system responds 
with a list of relevant or similar song recommendations. Such 
applications ultimately depend on the notion of similarity between 
items to produce high-quality results. Current state-of-the-art 
systems employ collaborative filter methods to represent musical 
items, effectively comparing items in terms of their constituent 
users. While collaborative filter techniques perform well when 
historical data is available for each item, their reliance on 
historical data impedes performance on novel or unpopular items. 
To combat this problem, practitioners rely on content-based 
similarity, which naturally extends to novel items, but is typically 
out-performed by collaborative filter methods. 

In this article, we propose a method for optimizing content- 
based similarity by learning from a sample of collaborative filter 
data. The optimized content-based similarity metric can then be 
applied to answer queries on novel and unpopular items, while 
still maintaining high recommendation accuracy. The proposed 
system yields accurate and efficient representations of audio 
content, and experimental results show significant improvements 
in accuracy over competing content-based recommendation tech- 
niques. 

Index Terms — Audio retrieval and recommendation, music 
information retrieval, query-by-example, collaborative filters, 
structured prediction. 

EDICS Category: AUD-CONT 



I. Introduction 

AN effective notion of similarity forms the basis of many 
applications involving multimedia data. For example, an 
online music store can benefit greatly from the development 
of an accurate method for automatically assessing similarity 
between two songs, which can in turn facilitate high-quality 
recommendations to a user by finding songs which are similar 
to her previous purchases or preferences. More generally, high- 
quality similarity can benefit any query-by-example recom- 
mendation system, wherein a user presents an example of an 
item that she likes, and the system responds with, e.g., a ranked 
list of recommendations. 

The most successful approaches to a wide variety of rec- 
ommendation tasks — including not just music, but books, 
movies, etc. — is collaborative filters (CP). Systems based 
on collaborative filters exploit the "wisdom of crowds" to 
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Fig. 1. Query-by-example recommendation engines allow a user to search 
for new items by providing an example item. Recommendations are formed 
by computing the most similar items to the query item from a database of 
potential recommendations. 



infer similarities between items, and recommend new items 
to users by representing and comparing these items in terms 
of the people who use them [1]. Within the domain of 
music information retrieval, recent studies have shown that 
CF systems consistently outperform alternative methods for 
playlist generation [2] and semantic annotation [3]. However, 
collaborative filters suffer from the dreaded "cold start" prob- 
lem: a new item cannot be recommended until it has been 
purchased, and it is less likely to be purchased if it is never 
recommended. Thus, only a tiny fraction of songs may be 
recommended, making it difficult for users to explore and 
discover new music [4]. 

The cold-start problem has motivated researchers to im- 
prove content-based recommendation engines. Content-based 
systems operate on music representations that are extracted 
automatically from the audio content, eliminating the need for 
human feedback and annotation when computing similarity. 
While this approach naturally extends to any item regardless 
of popularity, the construction of features and definition of 
similarity in these systems are frequently ad-hoc and not 
explicitly optimized for the specific task. 

In this paper, we propose a method for optimizing content- 
based audio similarity by learning from a sample of collabo- 
rative filter data. Based on this optimized similarity measure, 
recommendations can then be made where no collaborative 
filter data is available. The proposed method treats similarity 
learning as an information retrieval problem, where similarity 
is learned to optimize the ranked list of results in response to 
a query example (Figure 1). Optimizing similarity for rank- 
ing requires more sophisticated machinery than, e.g., genre 
classification for semantic search. However, the information 
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retrieval approach offers a few key advantages, which we 
believe are crucial for realistic music applications. First, there 
are no assumptions of transitivity or symmetry in the proposed 
method. This allows, for example, that "The Beatles" may 
be considered a relevant result for "Oasis", but not vice 
versa. Second, CF data can be collected passively from users 
by mining their listening histories, thereby directly capturing 
their listening habits. Finally, optimizing similarity for ranking 
directly attacks the main quantity of interest: the ordered list 
of retrieved items, rather than coarse abstractions of similarity, 
such as genre agreement. 

A. Related work 

Early studies of musical similarity followed the general 
strategy of first devising a model of audio content [e.g., 
spectral clusters [5] or Gaussian mixture models [6]), ap- 
plying some reasonable distance function {e.g., earth-mover's 
distance or KuUback-Leibler divergence), and then evaluating 
the proposed similarity model against some source of ground 
truth. Logan and Salomon [5] and Aucouturier and Pachet [6] 
evaluated against three notions of similarity between songs: 
same artist, same genre, and human survey data. Artist or 
genre agreement entail strongly binary notions of similarity, 
which due to symmetry and transitivity may be unrealistically 
coarse in practice. Survey data can encode subtle relationships 
between items, for example, triplets of the form "A is more 
similar to B than to C" [6]-[8]. However, the expressive 
power of human survey data comes at a cost: while artist or 
genre meta-data is relatively inexpensive to collect for a set of 
songs, similarity survey data may require human feedback on 
a quadratic (forpairwise ratings) or cubic (for triplets) number 
of comparisons between songs. 

Later work in musical similarity approaches the problem 
in the context of supervised learning: given a set of training 
items (songs), and some knowledge of similarity across those 
items, the goal is to learn a similarity (distance) function 
that can predict pairwise similarity. Slaney et al. [9] derive 
similarity from web-page co-occurrence, and evaluate several 
supervised and unsupervised algorithms for learning distance 
metrics. McFee and Lanckriet [10] develop a metric learning 
algorithm for triplet comparisons as described above. Our 
proposed method follows in this line of work, but is designed 
to optimize structured ranking loss (not just binary or triplet 
predictions), and uses a collaborative filter as the source of 
ground truth. 

The idea to learn similarity from a collaborative filter 
follows from a series of positive results in music applications. 
Slaney and White [11] demonstrate that an item-similarity 
metric derived from rating data matches human perception 
of similarity better than a content-based method. Similarly, 
it has been demonstrated that when combined with metric 
learning, collaborative filter similarity can be as effective as 
semantic tags for predicting survey data [10]. Kim et al. [3] 
demonstrated that collaborative filter similarity vastly out- 
performs content-based methods for predicting semantic tags. 
Harrington et al. [2] conducted a user survey, and concluded 
that the iTunes Genius playlist algorithm (which is at least 



partially based on collaborative filters') produces play lists of 
equal or higher quality than competing methods based on 
acoustic content or meta-data. 

Finally, there has been some previous work addressing the 
cold-start problem of collaborative filters for music recom- 
mendation by integrating audio content. Yoshii et al. [12] 
formulate a joint probabilistic model of both audio content 
and collaborative filter data in order to predict user ratings 
of songs (using either or both representations), whereas our 
goal here is to use audio data to predict the similarities 
derived from a collaborative filter Our problem setting is 
most similar to that of Stenzel and Kamps [13], wherein a 
CF matrix was derived from playlist data, clustered into latent 
"pseudo-genres," and classifiers were trained to predict the 
cluster membership of songs from audio data. Our proposed 
setting differs in that we derive similarity at the user level 
(not playlist level), and automatically learn the content-based 
song similarity that directly optimizes the primary quantity of 
interest in an information retrieval system: the quality of the 
rankings it induces. 

B. Our contributions 

Our primary contribution in this work is a framework for 
improving content-based audio similarity by learning from a 
sample of collaborative filter data. Toward this end, we first 
develop a method for deriving item similarity from a sample 
of collaborative filter data. We then use the sample similarity 
to train an optimal distance metric over audio descriptors. 
More precisely, a distance metric is optimized to produce 
high-quality rankings of the training sample in a query-by- 
example setting. The resulting distance metric can then be 
applied to previously unseen data for which collaborative filter 
data is unavailable. Experimental results verify that the pro- 
posed methods significantly outperform competing methods 
for content-based music retrieval. 

C. Preliminaries 

For a d-dimensional vector u ^ let u[i] denote its 
coordinate; similarly, for a matrix A, let A[ij\ denote its 
row and column entry. A square, symmetric matrix 
A^^dxd positive semi-definite (PSD, denoted A ^ 0) 
if each of its eigenvalues is non-negative. For two matrices 
A, B of compatible dimension, the Frobenius inner product is 
defined as 

{A,B)^ = iv{A'B)=Y,A[ij]BM. 

Finally, let denote the binary indicator function of the 
event x. 

II. Learning similarity 

The main focus of this work is the following information 
retrieval problem: given a query song q, return a ranked list 
from a database Xofn songs ordered by descending similarity 
to q. In general, the query may be previously unseen to the 

'http://www.apple.com/pr/library/2008/09/09itunes.html 
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system, but X will remain fixed across all queries. We will 
assume that each song is represented by a vector in W^, and 
similarity is computed by Euclidean distance. Thus, for any 
query q, a natural ordering of x G X is generated by sorting 
according to increasing distance from q: \\q ~ x\\. 

Given some side information describing the similarity re- 
lationships between items of X, distance-based ranking can 
be improved by applying a metric learning algorithm. Rather 
than rely on native Euclidean distance, the learning algorithm 
produces a PSD matrix W G R'*^'^ which characterizes an 
optimized distance: 

\\q~x\\w^yJiq-xyW{q^x). (1) 

In order to learn W, we will apply the metric learning to 
rank (MLR) [14] algorithm (Section II-B). At a high level, 
MLR optimizes the distance metric W on X, i.e., so that W 
generates optimal rankings of songs in X when using each 
song in A:" as a query. To apply the algorithm, we must provide 
a set of similar songs x G X for each training query q € X. 
This is achieved by leveraging the side information that is 
available for items in X. More specifically, we will derive a 
notion of similarity from collaborative filter data on X. So, the 
proposed approach optimizes content-based audio similarity 
by learning from a sample of collaborative filter data. 



In a binary CP matrix F, each column F[-j] can be 
interpreted as a bag-of-users representation of item j. Of 
central interest in this paper is the similarity between items 
{i.e., columns of F). We define the similarity between two 
items i,j as the Jaccard index [18] of their user sets; 

. \F[-i]r\F[-]]\ _ F[-iYF[-j] 

^'^> \F[■^]DF[■J]\ \F[■^]\ + \F[■J]\~F[■^VF[■Jy 

(2) 

which counts the number of users shared between A and B, 
and normalizes by the total number of users for A or B. 

Equation (2) defines a quantitative metric of similarity 
between two items. However, for information retrieval applica- 
tions, we are primarily interested in the most similar (relevant) 
items for any query. We therefore define the relevant set X^ 
for any item q as the top k most similar items according to 
Equation (2), i.e., those items which a user of the system 
would be shown first. Although binarizing similarity in this 
way does simplify the notion of relevance, it still provides 
a flexible language for encoding relationships between items. 
Note that after thresholding, transitivity and symmetry are not 
enforced, so it is possible, e.g., for The Beatles to be relevant 
for Oasis but not vice versa. Consequently, we will need a 
learning algorithm which can support such flexible encodings 
of relevance. 



A. Collaborative filters 

The term collaborative filter (CF) is generally used to 
denote to a wide variety of techniques for modeling the 
interactions between a set of items and a set of users [1], 
[15]. Often, these interactions are modeled as a (typically 
sparse) matrix F where rows represent the users, and columns 
represent the items. The entry F[ij] encodes the interaction 
between user i and item j. 

The majority of work in the CF literature deals with F 
derived from explicit user feedback, e.g., 5-star ratings [11], 
[12]. While rating data can provide highly accurate represen- 
tations of user-item affinity, it also has drawbacks, especially 
in the domain of music. First, explicit ratings require active 
participation on behalf of users. This may be acceptable for 
long-form content such as films, in which the time required 
for a user to rate an item is miniscule relative to the time 
required to consume it. However, for short-form content {e.g., 
songs), it seems unrealistic to expect a user to rate even a 
fraction of the items consumed. Second, the scale of rating 
data is often arbitrary, skewed toward the extremes {e.g., 1- 
and 5-star ratings), and may require careful calibration to use 
effectively [11]. 

Alternatively, CF data can also be derived from implicit 
feedback. While somewhat noisier on a per-user basis than 
explicit feedback, implicit feedback can be derived in much 
higher volumes by simply counting how often a user interacts 
with an item {e.g., listens to an artist) [16], [17]. Implicit 
feedback differs from rating data, in that it is positive and un- 
bounded, and it does not facilitate explicit negative feedback. 
As suggested by Hu et al. [17], binarizing an implicit feedback 
matrix by thresholding can provide an effective mechanism to 
infer positive associations. 



B. Metric learning to rank 

Any query-by-example retrieval system must have at its core 
a mechanism for comparing the query to a known database, 
i.e., assessing similarity (or distance). Intuitively, the overall 
system should yield better results if the underlying similarity 
mechanism is optimized according to the chosen task. In 
classification tasks, for example, this general idea has led to 
a family of algorithms collectively known as metric learning, 
in which a feature space is optimized (typically by a linear 
transformation) to improve performance of nearest-neighbor 
classification [19]-[21]. While metric learning algorithms have 
been demonstrated to yield substantial improvements in clas- 
sification performance, nearly all of them are fundamentally 
limited to classification, and do not readily generalize to 
asymmetric and non-transitive notions of similarity or rele- 
vance. Moreover, the objective functions optimized by most 
metric learning algorithms do not clearly relate to ranking 
performance, which is of fundamental interest in information 
retrieval applications. 

Rankings, being inherently combinatorial objects, can be 
notoriously difficult to optimize. Performance measures of 
rankings, e.g., area under the ROC curve (AUC) [22], are 
typically non-differentiable, discontinuous functions of the 
underlying parameters, so standard numerical optimization 
techniques cannot be directly applied. However, in recent 
years, algorithms based on the structural SVM [23] have been 
developed which can efficiently optimize a variety of ranking 
performance measures [24]-[26]. While these algorithms sup- 
port general notions of relevance, they do not directly exploit 
the structure of query-by-example retrieval problems. 

The metric learning to rank (MLR) algorithm combines 
these two approaches of metric learning and structural SVM, 
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Fig. 2. Left: a query point * and its relevant (+) and irrelevant (-) results; 
ranking by distance from * results in poor retrieval performance. Right: after 
learning an optimal distance metric with MLR, relevant results are ranked 
higher than irrelevant results. 



and is designed specifically for the query-by-example set- 
ting [14]. MLR learns a positive semi-definite matrix W such 
that rankings induced by learned distances (Equation (1)) are 
optimized according to a ranking loss measure, e.g., AUC, 
mean reciprocal rank (MRR) [27], or normaUzed discounted 
cumulative gain (NDCG) [28]. In this setting, "relevant" 
results should lie close in space to the query q, and "irrelevant" 
results should be pushed far away. 

For a query song q, the database X is ordered by sorting 
X € X according to increasing distance from q under the 
metric defined by W (see Figure 2). The metric W is learned 
by solving a constrained convex optimization problem such 
that, for each training query q, a higher score is assigned to 
a correct ranking yq than to any other ranking y £ y (the set 
of all rankings): 

Vq: {W,^iq,y,))F>{W,^{q,y))F + A{y„y)-^,. (3) 

Here, the "score" for a query-ranking pair (q, y) is computed 
by the Frobenius inner product {W,ilj{q,y))p. tp{q,y) is a 
matrix-valued feature map which encodes the query-ranking 
pair {q,y), and A{yq,y) computes the loss (e.g., decrease in 
AUC) incurred by predicting y instead of y^ for the query q, 
essentially playing the role of the "margin" between rankings 
yq and y. Intuitively, the score for a correct ranking yq should 
exceed the score for any other y by at least the loss A{yq,y). 
In the present context, a correct ranking is any one which 
places all relevant results before all irrelevant results X~ . 
To allow violations of margins during training, a slack variable 
> is introduced for each query. 

Having defined the margin constraints (Equation (3)), what 
remains to be specified, to learn W, is the feature map ijj and 
the objective function of the optimization. To define the feature 
map ip, we first observe that the margin constraints indicate 
that, for a query q, the predicted ranking y should be that 
which maximizes the score {W,ip{q,y))p. Consequently, the 
(matrix-valued) feature map i(;{q, y) must be chosen so that 
the score maximization coincides with the distance-ranking 
induced by W, which is, after all, the prediction rule we 
propose to use in practice, for query-by-example recommenda- 
tion (Equation (1)). To accomplish this, MLR encodes query- 
ranking pairs {q,y) by the partial order feature [24]: 

i^[Q,y)= 2^ 2^ y^3 — — , (4) 



where X^ (X^ ) is the set of relevant (irrelevant) songs for q, 
the ranking y is encoded by 



-1 i before j in y 
-1 i after j 



and (j>{q, i) is an auxiliary (matrix-valued) feature map that 
encodes the relationship between the query q and an individual 
result i. Intuitively, y) decomposes the ranking y into 
pairs S X^ x X~ , and computes a signed average of 

pairwise differences (j){q,i) — (j){q,j). If y places i before j 
{i.e., correctly orders i and j), the difference (j){q, i) — 4>{q,j) 
is added to ^{q,y), and otherwise it is subtracted. Note that 
under this definition of ip, any two correct rankings yq , y'^ 
have the same feature representation: tpiqjVq) = i^iQiVq)- It 
therefore suffices to only encode a single correct ranking yq 
for each query q to construct margin constraints (Equation (3)) 
during optimization. 

Since ip is linear in cf), the score also decomposes into a 
signed average across pairs: 



{W,ijiq,y)}, 



E E 

i&X+ ]£X, 



y^j- 



\Xa 



(5) 



This indicates that the score {W,'ip{q,yq))-p for a correct 
ranking yq (the left-hand side of Equation (3)) will be larger 
when the point-wise score {W, 4){q, ■)}f is high for relevant 
points i, and low for irrelevant points j, i.e.. 



Vz e X+,j e X- : {W,4>{q,i))p > {W,(f>{q,j))f 



(6) 



Indeed, this will accumulate only positive terms in the score 
computation in Equation (5), since a correct ranking orders all 
relevant results i before all irrelevant results j and, thus, each 
yij in the summation will be positive. Similarly, for incorrect 
rankings y, point-wise scores satisfying Equation (6) will lead 
to smaller scores (W, V'('Z, y))F- Ideally, after training, W is 
maximally aligned to correct rankings yq [i.e., {W, ip{q, y<j))F 
achieves large margin over scores {W, ipil, y))F for incorrect 
rankings) by (approximately) satisfying Equation (6). Conse- 
quently, at test time [i.e., in the absence of a correct ranking 
yq), the ranking for a query q is predicted by sorting i ^ X 
in descending order of point-wise score {W, 4){q, [24]. 
This motivates the choice of (/> used by MLR: 
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i) = -{q - i){q - i)'^ , 



(7) 



which upon taking an inner product with W, yields the 
negative, squared distance between q and i under W: 



(M^,0(g,z))F = -tr(l^(g-z)(g-2)T) 
= -iq-iyW{q^i) 



(8) 



Descending point-wise score {W,(j){q,i))p therefore corre- 
sponds to increasing distance from q. As a result, the ranking 
predicted by descending score is equivalent to that predicted 
by increasing distance from q, which is precisely the ranking 
of interest for query-by-example recommendation. 

The MLR optimization problem is listed as Algorithm 1. 
As in support vector machines [29], the objective consists of 
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Algorithm 1 Metiic learning to rank [14] 

Input: data X = {qi, q2, . . . , (?„} C M'', 
correct rankings {yq : q € X}, 
slack trade-off C > 

Output: d X d matrix W 



mm 



tT{W)+C 



s.t. VqeX, Vyey-. 

{W,ijiq,yg))F>{W,i;iq,y))^- 



.A. 
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I 



♦ 

■t'3 



Fig. 3. Two close data points x\ , X2 (+) and the Voronoi partition for thi'ee 
VQ codewords vi,V2,v:i (♦). Left: liard VQ (t = 1) assigns similar data 
points to dissimilar histograms. Right: assigning each data point to its top 
T = 2 codewords reduces noise in codeword histogram representations. 



two competing terms: a regularization term tr(iy), which is 
a convex approximation to the rank of the learned metric, and 
l/nJ2^q provides a convex upper bound on the empirical 
training loss A, and the two terms are balanced by a trade- 
off parameter C. Although the full problem includes a super- 
exponential number of constraints (one for each y G y, for 
each q), it can be approximated by cutting plane optimization 
techniques [14], [30]. 

III. Audio representation 

In order to compactly summarize audio signals, we rep- 
resent each song as a histogram over a dictionary of tim- 
bral codewords. This general strategy has been successful 
in computer vision applications [31], as well as audio and 
music classification [32]-[34]. As a first step, a codebook is 
constructed by clustering a large collection of feature descrip- 
tors (Section III-A). Once the codebook has been constructed, 
each song is summarized by aggregating vector quantization 
(VQ) representations across all frames in the song, resulting 
in codeword histograms (Section III-B). Finally, histograms 
are represented in a non-linear kernel space to facilitate better 
learning with MLR (Section III-C). 

A. Codebook training 

Our general approach to constructing a codebook for vector 
quantization is to aggregate audio feature descriptors from a 
large pool of songs into a single bag-of-features, which is then 
clustered to produce the codebook. 

For each song x in the codebook training set Xc — which 
may generally be distinct from the MLR training set X — 
we compute the first 13 Mel frequency cepstral coefficients 
(MFCCs) [35] from each half-overlapping 23ms frame. From 
the time series of MFCC vectors, we compute the first and sec- 
ond instantaneous derivatives, which are concatenated to form 
a sequence of 39-dimensional dynamic MFCC (AMFCC) 
vectors [36]. These descriptors are then aggregated across all 
X € Xc to form an unordered bag of features Z. 

To correct for changes in scale across different AMFCC 
dimensions, each vector z G Z is normalized according to 
the sample mean fi G R'^^ and standard deviation a e R""*^ 
estimated from Z. The i* coordinate z[i] is mapped by 



The normalized AMFCC vectors are then clustered into a set 
V of |V| codewords by k-means (specifically, an online variant 
of Hartigan's method [37]). 

B. (Top-r) Vector quantization 

Once the codebook V has been constructed, a song x is 
represented as a histogram over the codewords in V. This 
proceeds in three steps: 1) a bag-of-features is computed from 
x's AMFCCs, denoted as x = {a;J C R^^; 2) each Xi G 
X is normalized according to Equation (9); 3) the codeword 
histogram is constructed by counting the frequency with which 
each codeword ij G V quantizes an element of x'? 



\t.\ ^ 



V = argmm \ \Xi 
tiev 



(10) 



Codeword histograms are normalized by the number of frames 
a; I in the song in order to ensure comparability between songs 
of different lengths; may therefore be interpreted as a 
multinomial distribution over codewords. 

Equation (10) derives from the standard notion of vector 
quantization (VQ), where each vector (e.g., data point xi) 
is replaced by its closest quantizer However, VQ can be- 
come unstable when a vector has multiple, (approximately) 
equidistant quantizers (Figure 3, left), which is more likely to 
happen as the size of the codebook increases. To counteract 
quantization errors, we generalize Equation (10) to support 
multiple quantizers for each vector 

For a vector Xi, a codebook V, and a quantization threshold 
rG{l,2,...,|V|}, we define the quantization set 



argmm 

uev 



{u is a T-nearest neighbor of Xi} . 



The top-T codeword histogram for a song x is then constructed 

as 



V G argmin 

uev 



(11) 



zh\ ^ 



(9) 



Intuitively, Equation (11) assigns 1/r mass to each of the t 
closest codewords for each Xi € x (Figure 3, right). Note 
that when r = 1, Equation (11) reduces to Equation (10). 
The normalization by 1/r ensures that X^^^'^a-M = 1' s° 
that for T > 1, Kj. retains its interpretation as a multinomial 
distribution over V. 

^To simplify notation, we denote by hx{v\ the bin of histogram hx 
con'esponding to the codeword £ V. Codewords are assumed to be unique, 
and the usage should be clear from context. 
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C. Histogram representation and distance 

After summarizing each song x by a codeword histogram 
/i^, these histograms may be interpreted as vectors in M'^L 
Subsequently, for a query song q, retrieval may be performed 
by ordering x ^ X according to increasing (Euclidean) 
distance — After optimizing W with Algorithm 1, 
the same codeword histogram vectors may be used to perform 
retrieval with respect to the learned metric — ''■JIIh'- 

However, treating codeword histograms directly as vec- 
tors in a Euclidean space ignores the simplical structure 
of multinomial distributions. To better exploit the geometry 
of codeword histograms, we represent each histogram in a 
probability product kernel (PPK) space [38]. Inner products in 
this space can be computed by evaluating the corresponding 
kernel function k. For PPK space, k is defined as: 



E 



(12) 



The PPK inner product in Equation (12) is equivalent to the 
Bhattacharyya coefficient [39] between /ij and h^.. Conse- 
quently, distance in PPK space induces the same rankings as 
HelUnger distance between histograms. 

Typically in kernel methods, data is represented implic- 
itly in a (typically high-dimensional) Hilbert space via the 
n X n matrix of inner products between training points, i.e., 
the kernel matrix [40]. This representation enables efficient 
learning, even when the dimensionality of the kernel space is 
much larger than the number of points (e.g., for histogram- 
intersection kernels [41]) or infinite (e.g., radial basis func- 
tions). The MLR algorithm has been extended to support 
optimization of distances in such spaces by reformulating the 
optimization in terms of the kernel matrix, and optimizing 
an n X n matrix W ^ [42]. While kernel MLR supports 
optimization in arbitrary inner product spaces, it can be 
difficult to scale up to large training sets {i.e., large which 
may require some approximations, e.g., by restricting W to be 
diagonal. 

However, for the present application, we can exploit the spe- 
cific structure of the probability product kernel (on histograms) 
and optimize distances in PPK space with complexity that de- 
pends on |V| rather than 7i, thereby supporting larger training 
sets. Note that PPK enables an explicit representation of the 
data according to a simple, coordinate-wise transformation: 



hi[v] ^ ^kM, 



(13) 



which, since fc(/i!^,/i^) = 1 for all h"^, can be interpreted as 
mapping the |V| -dimensional simplex to the |V [-dimensional 
unit sphere. Training data may therefore be represented as a 
|V| X n data matrix, rather than the n x n kernel matrix. As 
a result, we can equivalently apply Equation (13) to the data, 
and learn a |V| x |V| matrix W with Algorithm 1, which is 
more efficient than using kernel MLR when |V| < n, as is 
often the case in our experiments. 

IV. Experiments 

Our experiments are designed to simulate query-by-example 
content-based retrieval of songs from a fixed database. Fig- 
ure 4 illustrates the high-level experimental setup: training 
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Fig. 4. Schematic diagram of training and retrieval. Here, "training data" 
encompasses botli the subset of X used to train the metric W , and the 
codebook set Xq used to build the codebook V. While, in our experiments, 
both sets are disjoint, in general, data used to build the codebook may also 
be used to train the metric. 



and evaluation are conducted with respect to collaborative 
filter similarity (as described in Section II-A). In this section, 
we describe the sources of collaborative filter and audio data, 
experimental procedure, and competing methods against which 
we compare. 

A. Data 

1) Collaborative filter: Last.FM: Our collaborative filter 
data is provided by Last.fm^, and was collected by Celma [4, 
chapter 3]. The data consists of a users-by-artists matrix F of 
359,347 unique users and 186,642 unique, identifiable artists; 
the entry F[ii] contains the number of times user i listened 
to artist j. We binarize the matrix by thresholding at 10, i.e., 
a user must listen to an artist at least 10 times before we 
consider the association meaningful. 

2) Audio: CALIOK: For our audio data, we use the 
CALIOK data set [43]. Starting from 10,832 songs by 4,661 
unique artists, we first partition the set of artists into those 
with at least 100 listeners in the binarized CF matrix (2015, 
the experiment set), and those with fewer than 100 listeners 
(2646, the codebook set). We then restrict the CF matrix to 
just those 2015 artists in the experiment set, with sufficiently 
many listeners. From this restricted CF matrix, we compute 
the artist-by-artist similarity matrix according to Equation (2). 

Artists in the codebook set, with insufficiently many listen- 
ers, are held out from the experiments in Section IV-B, but 
their songs are used to construct four codebooks as described 
in Section III-A. From each held out artist, we randomly 
select one song, and extract a 5-second sequence of AMFCC 
vectors (431 half-overlapping 23ms frames at 22050Hz). These 
samples are collected into a bag-of-features of approximately 
1.1 million samples, which is randomly permuted, and clus- 
tered via online k-means in a single pass to build four 
codebooks of sizes | V| G {256, 512, 1024, 2048}, respectively. 

^http://www.last.fm/ 
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Training Validation Test 

# Artists 806 604 605~ 

# Songs 2122.3 ± 36.3 1589.3 ± 38.6 1607.5 ± 64.3 

# Relevant 36.9 ± 16.4 36.4 ± 15.4 37.1 ± 16.0 

TABLE I 

Statistics of CALIOKdata, averaged across ten random 

training/validation/test SPLITS. #Re/eX'a«« IS THE AVERAGE NUMBER 
OF RELEVANT SONGS FOR EACH TRAINING/VALIDATION/TEST SONG. 



Cluster centers are initialized to the first (randomly selected) 
k points. Note that only the artists from the codebook set (and 
thus no artists from the experiment set) are used to construct 
the codebooks. As a result, the previous four codebooks are 
fixed throughout the experiments in the following section. 

B. Procedure 

For our experiments, we generate 10 random splits of 
the experiment set of 2015 artists into 40% training, 30% 
validation and 30% test artists'^ . For each split, the set of all 
training artist songs forms the training set, which serves as 
the database of "known" songs, X. For each split, and for 
each (training/test/validation) artist, we then define the relevant 
artist set as the top 10 most similar training^ artists. Finally, 
for any song q by artist i, we define g's relevant song set, 
X^, as all songs by all artists in i's relevant artist set. The 
songs by all other training artists, not in i's relevant artist set, 
are collected into X~ , the set of irrelevant songs for q. The 
statistics of the training, validation, and test splits are collected 
in Table 1. 

For each of the four codebooks, constructed in the pre- 
vious section, each song was represented by a histogram 
over codewords using Equation (11), with r G {1,2,4,8}. 
Codeword histograms were then mapped into PPK space by 
Equation (13). For comparison purposes, we also experiment 
with Euclidean distance and MLR on the raw codeword 
histograms. 

To train the distance metric with Algorithm 1, we vary C G 
{10~^, 10~^, • ■ • ,10*^}. We experiment with three ranking 
losses A for training: area under the ROC curve (AUC), which 
captures global qualities of the ranking, but penalizes mistakes 
equally regardless of their position in the ranking; normalized 
discounted cumulative gain (NDCG), which applies larger 
penalties to mistakes at the beginning of the ranking than at 
the end, and is therefore more localized than AUC; and mean 
reciprocal rank (MRR), which is determined by the position 
of the first relevant result, and is therefore the most localized 
ranking loss under consideration here. After learning W on the 
training set, retrieval is evaluated on the validation set, and the 
parameter setting (C, A) which achieves highest AUC on the 
validation set is then evaluated on the test set. 

To evaluate a metric W, the training set X is ranked 
according to distance from each test (validation) song q under 

''Due to recording effects and our definition of similarity, it is crucial to 
split at the level of artists rather than songs [44]. 

^Also for test and validation artists, we restrict the relevant artist set to the 
training artists to mimic the realistic setting of retrieving "known" songs from 
X, given an "unknown" (test/validation) query. 
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W, and we record the mean AUC of the rankings over all test 
(vaUdation) songs. 

Prior to training with MLR, codeword histograms are com- 
pressed via principal components analysis (PCA) to capture 
95% of the variance as estimated on the training set. While 
primarily done for computational efficiency, this step is similar 
to the latent perceptual indexing method described by Sun- 
daram and Narayanan [32], and may also be interpreted as 
de-noising the codeword histogram representations. In prelim- 
inary experiments, compression of codeword histograms was 
not observed to significantly affect retrieval accuracy in either 
the native or PPK spaces (without MLR optimization). 

C. Comparisons 

To evaluate the performance of the proposed system, we 
compare to several alternative methods for content-based 
query-by-example song retrieval; first, similarity derived from 
comparing Gaussian mixture models of AMFCCs; second, an 
alternative (unsupervised) weighting of VQ codewords; and 
third, a high-level, automatic semantic annotation method. We 
also include a comparison to a manual semantic annotation 
method {i.e., driven by human experts), which although not 
content-based, can provide an estimate of an upper bound on 
what can be achieved by content-based methods. For both 
manual and automatic semantic annotations, we will also 
compare to their MLR-optimized counterparts. 

1) Gaussian mixtures: From each song, a Gaussian mix- 
ture model (GMM) over its AMFCCs was estimated via 
expectation-maximization [45]. Each GMM consists of 8 
components with diagonal covariance. The training set X is 
therefore represented as a collection of GMM distributions 
{px '■ X & X}. This approach is fairly standard in music 
information retrieval [6], [8], [46], and is intended to serve 
as a baseline against which we can compare the proposed VQ 
approach. 

At test time, given a query song q, we first estimate its 
GMM pq. We would then like to rank each x ^ X hy 
increasing Kullback-Leibler (KL) divergence [47] from pq. 

D{pq\\px) - / p,(z)log^dz. (14) 

However, we do not have a closed-form expression for KL di- 
vergence between GMMs, so we must resort to approximate 
methods. Several such approximation schemes have been 
devised in recent years, including variational methods and 
sampling approaches [48]. Here, we opt for the Monte Carlo 
approximation: 

m Px[Zi) 

where {^i}™! is a collection of m independent samples 
drawn from pq. Although the Monte Carlo approximation is 
considerably slower than closed-form approximations {e.g., 
variational methods), with enough samples, it often exhibits 
higher accuracy [46], [48]. Note that because we are only 
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interested in the rank-ordering of X given pq, it is equivalent to 
order each p^; G A" by increasing (approximate) cross-entropy: 



H{Pq,P.) = jv,{z) log ^d. - E ^ log ^ 



(16) 

For efficiency purposes, for each query q we fix the sample 
{zi}Y=i ~ Pq across all x € X. We use m = 2048 samples 
for each query, which was found to yield stable cross-entropy 
estimates in an informal, prehminary experiment. 

2) TF-IDF: The algorithm described in Section II-B is 
a supervised approach to learning an optimal transformation 
of feature descriptors (in this specific case, VQ histograms). 
Alternatively, it is common to use the natural statistics of 
the data in an unsupervised fashion to transform the feature 
descriptors. As a baseline, we compare to the standard method 
of combining term frequency-inverse document frequency (TF- 
IDF) [49] representations with cosine similarity, which is 
commonly used with both text [49] and codeword representa- 
tions [50]. 

Given a codeword histogram /i^, for each w G V, hq[v] is 
mapped to its TF-IDF value by^ 



hZ[v]^h:[v]-lDF[v], 



(17) 



where IDF[i;] is computed from the statistics of the training 
set by^ 

IDFH = log — — — ^^1— — -. (18) 

\{x G X : x[v\ > 0) 

Intuitively, IDF[w] assigns more weight to codewords v which 
appear in fewer songs, and reduces the importance of code- 
words appearing in many songs. The training set X is ac- 
cordingly represented by TF-IDF vectors. At test time, each 
X G X is ranked according to decreasing cosine-similarity to 
the query q: 



hfhl 



\K\\-\\K\\ 



(19) 



3) Automatic semantic tags: The proposed method relies 
on low-level descriptors to assess similarity between songs. 
Alternatively, similarity may be assessed by comparing high- 
level content descriptors in the form of semantic tags. These 
tags may include words to describe genre, instrumentation, 
emotion, etc. Because semantic annotations may not be avail- 
able for novel query songs, we restrict attention to algorithms 
which automatically predict tags given only audio content. 

In our experiments, we adapt the auto-tagging method 
proposed by Tumbull et al. [51]. This method summarizes 
each song by a semantic multinomial distribution (SMD) over 
a vocabulary of 149 tag words. Each tag t is characterized 
by a GMM pt over AMFCC vectors, each of which was 
trained previously on the CAL500 data set [52]. A song q 

^Since codeword histograms are pre-normahzed, there is no need to re- 
compute the term frequency in Equation (17). 

^To avoid division by 0, we define IDF[?;] = for any codeword v which 
is not used in the training set. 



is summarized by a multinomial distribution Sq, where the t 
entry is computed by the geometric mean of the likelihood of 
g's AMFCC vectors qi under pt: 



i/kl 



n ^'t^*) 

\q,eq J 



(20) 



(Each SMD Sq is normalized to sum to 1.) The training set 
X is thus described as a collection of SMDs {sj^ : x € X}. 
At test time, X is ranked according to increasing distance 
from the test query under the probability product kernel^ 
as described in Section III-C. This representation is also 
amenable to optimization by MLR, and we will compare to 
retrieval performance after optimizing PPK representations of 
SMDs with MLR. 

4) Human tags: Our final comparison uses semantic anno- 
tations manually produced by humans, and may therefore be 
interpreted as an upper bound on how well we may expect 
content-based methods to perform. Each song in CALIOK 
includes a partially observed, binary annotation vector over 
a vocabulary of 1053 tags from the Music Genome Project^. 
The annotation vectors are weak in the sense that a 1 indicates 
that the tag applies, while a indicates only that the tag may 
not apply. 

In our experiments, we observed the best performance by 
using cosine similarity as the retrieval function, although we 
also tested TF-IDF and Euclidean distances. As in the auto-tag 
case, we will also compare to tag vectors after optimization 
by MLR. When training with MLR, annotation vectors were 
compressed via PC A to capture 95% of the training set 
variance. 

V. Results 

Vector quantization 

In a first series of experiments, we evaluate various ap- 
proaches and configurations based on VQ codeword his- 
tograms. Figure 5 lists the AUC achieved by four different 
approaches (Native, TF-IDF, MLR, PPK-MLR), based on VQ 
codeword histograms, for each of four codebook sizes and 
each of four quantization thresholds. We observe that using 
Euclidean distance on raw codeword histograms '° {Native) 
yields significantly higher performance for codebooks of in- 
termediate size (512 or 1024) than for small (256) or large 
(2048) codebooks. For the 1024 codebook, increasing r results 
in significant gains in performance, but it does not exceed the 
performance for the 512 codebook. The decrease in accuracy 
for |V| = 2048 suggests that performance is indeed sensitive 
to overly large codebooks. 

After learning an optimal distance metric with MLR on raw 
histograms {i.e., not PPK representations) {MLR), we observe 

*We also experimented with x^-distance, £i, EucHdean, and (symmetrized) 
KL divergence, but PPK distance was always statistically equivalent to the 
best-performing distance. 

'http://www.pandora.com/mgp.shtml 

'"For clarity, we omit the performance curves for native Euclidean distance 
on PPK representations, as they do not differ significantly from the Native 
curves shown. 
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Fig. 5. Retrieval accuracy with vector quantized AMFCC represen- 
tations. Each grouping corresponds to a different codebook size |V| S 
{256, 512, 1024, 2048}. Each point within a group corresponds to a different 
quantization threshold r G {1,2,4,8}. TF-IDF refers to cosine similarity 
appUed to IDF- weighted VQ histograms; Native refers to Euclidean distance 
on unweighted VQ histograms; MLR refers to VQ histograms after optimiza- 
tion by MLR; PPK MLR refers to distances after mapping VQ histograms 
into probability product kemel space and subsequently optimizing with MLR. 
Error bars correspond to one standard deviation across trials. 



two interesting effects. First, MLR optimization always yields 
significantly better performance than the native Euclidean 
distance. Second, performance is much less sensitive to the 
choice of codebook size and quantization threshold: all settings 
of T for codebooks of size at least |V| > 512 achieve 
statistically equivalent performance. 

Finally, we observe the highest performance by combining 
the PPK representation with MLR optimization (PPK-MLR). 
For |V| = 1024, r = 1, the mean AUC score improves from 
0.680 ± 0.006 (Native) to 0.808 ± 0.005 (PPK-MLR). The 
effects of codebook size and quantization threshold are dimin- 
ished by MLR optimization, although they are slightly more 
pronounced than in the previous case without PPK. We may 
then ask; does top-r VQ provide any benefit? 

Figure 6 lists the effective dimensionality — the number 
of principal components necessary to capture 95% of the 
training set's variance — of codeword histograms in PPK 
space as a function of quantization threshold r. Although 
for the best-performing codebook size |V| = 1024, each of 
r G {1,2,4} achieves statistically equivalent performance, 
the effective dimensionality varies from 253.1 ± 6.0 (r = 1) 
to 106.6 ± 3.3 (t = 4). Thus, top-r VQ can be applied to 
dramatically reduce the dimensionality of VQ representations, 
which in turn reduces the number of parameters learned by 
MLR, and therefore improves the efficiency of learning and 
retrieval, without significantly degrading performance. 

Qualitative results 

Figure 7 illustrates an example optimized similarity space 
produced by MLR on PPK histogram representations, as 
visualized in two dimensions by t-SNE [53]. Even though the 
algorithm is never exposed to any explicit semantic informa- 
tion, the optimized space does exhibit regions which seem to 
capture intuitive notions of genre, such as hip-hop, metal, and 
classical. 

Table II illustrates a few example queries and their top- 
5 closest results under the EucUdean and MLR-optimized 



Fig. 6. The effective dimensionality' of codeword histograms in PPK space, 
i.e., the number of principal components necessary to capture 95% of the 
training set's variance, as a function of the quantization threshold r. (The 
results reported in the figure are the average effective dimension ± one 
standard deviation across trials.) 



metric. The native space seems to capture similarities due to 
energy and instrumentation, but does not necessarily match CF 
similarity. The optimized space captures aspects of the audio 
data which correspond to CF similarity, and produces playlists 
with more relevant results. 

Comparison 

Figure 5 lists the accuracy achieved by using TF-IDF 
weighting on codeword histograms. For all VQ configurations 
{i.e., for each codebook size and quantization threshold) TF- 
IDF significantly degrades performance compared to MLR- 
based methods, which indicates that inverse document fre- 
quency may not be as an accurate predictor of salience in 
codeword histograms as in natural language [49]. 

Figure 8 shows the performance of all other methods 
against which we compare. First, we observe that raw SMD 
representations provide more accurate retrieval than both the 
GMM approach and raw VQ codeword histograms (i.e., prior 
to optimization by MLR). This may be expected, as previous 
studies have demonstrated superior query -by-example retrieval 
performance when using semantic representations of multime- 
dia data [54], [55]. 

Moreover, SMD and VQ can be optimized by MLR to 
achieve significantly higher performance than raw SMD and 
VQ, respectively. The semantic representations in SMD com- 
press the original audio content to a small set of descriptive 
terms, at a higher level of abstraction. In raw form, this 
representation provides a more robust set of features, which 
improves recommendation performance compared to matching 
low-level content features that are often noisier On the other 
hand, semantic representations are inherently limited by the 
choice of vocabulary and may prematurely discard important 
discriminative information {e.g., subtle distinctions within sub- 
genres). This renders them less attractive as starting point 
for a metric learning algorithm like MLR, compared to less- 
compressed (but possibly noisier) representations, like VQ. 
Indeed, the latter may retain more information for MLR to 
learn an appropriate similarity function. This is confirmed by 
our experiments: MLR improves VQ significantly more than 
it does for SMD. As a result, MLR-VQ outperforms all other 
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Fig. 7. A t-SNE visualization of the optimized similaiity space produced by PPK+MLR on one training/test split of the data (|V 
on three peiipheral regions reveal hip-hop (upper-right), metal (lower-left), and classical (lower-right) geni'es. 



1024, T = 1). Close-ups 



Test query 



VQ (Native) 



VQ (PPK+MLR) 



Omette Coleman - Africa is the Mirror of 
All Colors 



Judas Priest - You've Got Another Thing Comin' 

Def Leppard - Rock of Ages 

KC & The Sunshine Band - Give it Up 

Wynton Marsalis - Caravan 

Ringo Starr - It Don't Come Easy 



Wynton Marsalis - Caravan 
►Dizzy Gillespie - Dizzy's Blues 
►Michael Brecker - Two Blocks from the Edge 
►Eric Dolphy - Miss Ann (live) 

Ramsey Lewis - Here Comes Santa Claus 



Fats Waller - Winter Weather 



►Dizzy Gillespie - She's Funny that Way 
Enrique Morente - Solea 
Chet Atkins - In the Mood 
Rachmaninov - Piano Concerto #4 in Gmin 
Eluvium - Radio Ballet 



Chet Atkins - In the Mood 
►Charlie Parker - What Is This Thing Called Love? 
►Bud Powell - Oblivion 

►Bob Wills & His Texas Playboys - Lyla Lou 
►Bob Wills & His Texas Playboys - Sittin' On Top 
Of The Worid 





Def Leppard - Promises 


►The Buzzcocks - Harmony In My Head 




►The Buzzcocks - Harmony In My Head 


Motley Crue - Same Ol' Situation 


The Ramones - Go Mental 


Los Lonely Boys - Roses 


►The Offspring - Gotta Get Away 




Wolfmother - Colossal 


►The Misfits - Skulls 




Judas Priest - Diamonds and Rust (live) 


►AC/DC - Who Made Who (live) 



TABLE II 

Example playlists generated by 5-nearest (training) neighbors of three different query (test) songs (left) using Euclidean 

DISTANCE on RAW CODEWORD HISTOGRAMS (CENTER) AND MLR-OPTIMIZED PPK DISTANCES (RIGHT). RELEVANT RESULTS ARE INDICATED B Y ► 



content-based methods in our experiments. 

Finally, we provide an estimate of an upper bound on 
what can be achieved by automatic, content-based methods, 
by evaluating the retrieval performance when using manual 
annotations (Tag in Figure 8): 0.834 ± 0.005 with cosine 
similarity, and 0.907 ±0.008 with MLR-optimized similarity. 
The improvement in accuracy for human tags, when using 
MLR, indicates that even hand-crafted annotations can be 
improved by learning an optimal distance over tag vectors. By 
contrast, TF-IDF on human tag vectors decreases performance 
to 0.771±0.004, indicating that IDF does not accurately model 
(binary) tag salience. The gap in performance between content- 
based methods and manual annotations suggests that there 
is still room for improvement. Closing this gap may require 



incorporating more complex features to capture rhythmic and 
structural properties of music which are discarded by the 
simple timbral descriptors used here. 

VI. Conclusion 

In this article, we have proposed a method for improving 
content-based audio similarity by learning from a sample of 
collaborative filter data. Collaborative filters form the basis of 
state-of-the-art recommendation systems, but cannot directly 
form recommendations or answer queries for items which 
have not yet been consumed or rated. By optimizing content- 
based similarity from a collaborative filter, we provide a 
simple mechanism for alleviating the cold-start problem and 
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Fig. 8. Comparison of VQ-based retrieval accuracy to competing methods. 
VQ corresponds to a codebook of size V = 1024 with quantization 
threshold t = 1. Tog-based methods (red) use human annotations, and are 
not automatically derived from audio content. EiTor bars conespond to one 
standard deviation across trials. 

extending music recommendation to novel or less known 
songs. 

By using implicit feedback in the form of user listening 
history, we can efficiently collect high-quaUty training data 
without active user participation, and as a result, train on larger 
collections of music than would be practical with explicit 
feedback or survey data. Our notion of similarity derives from 
user activity in a bottom-up fashion, and obviates the need for 
coarse simplifications such as genre or artist agreement. 

Our proposed top-r VQ audio representation enables effi- 
cient and compact description of the acoustic content of music 
data. Combining this audio representation with an optimized 
distance metric yields similarity calculations which are both 
efficient to compute and substantially more accurate than 
competing content-based methods. 

While in this work, our focus remains on music rec- 
ommendation applications, the proposed methods are quite 
general, and may apply to a wide variety of applications 
involving content-based similarity, such as nearest-neighbor 
classification of audio signals. 
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